home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Windows Expert
/
Windows Expert.iso
/
desktop
/
iindxv10.zip
/
INDEX.DOC
< prev
next >
Wrap
Text File
|
1993-03-22
|
29KB
|
808 lines
Instant Index
Transform-Based Full Text Indexing and Search Software
Version 1.0
Documentation
Instant Index, Copyright (C) 1992 1993 Theodore A. Holden,
All rights Reserved
LICENSE
Instant Index v1.0 is neither free software nor is it in the
public domain. The software and its documentation, this file, are
property of the author and may not be sold without written
permission from the author.
Instant Index v1.0 is distributed as shareware. This means that
you are granted a limited license to use it for a period of 30
days. If you find it useful and decide to continue using it after
the trial period, registration is required.
Registered Individual users will be granted a just-like-a-book license
which means a registered version of the software can be used by more
than one person and can be moved from one computer to another so
long as there is NO POSSIBILITY of it being used by two different
persons on two different computers at the same time, just like a
book can not be read by two different persons in two different
places at the same time. This is mainly intended to allow the
typical individual user to use the product on a computer at home and
on a computer at the office.
Two individually licensed copies of Instant Index, of any version,
with the same serial number, may not legally appear on more than one
computer at any place of business, government agency, school,
etc.
Version 2.o of Instant Index is a commercial product and is not
shareware.
Commercial site licenses for all versions of Instant Index are
available at reasonable rates.
Instant Index Copyright 1992, 1993 Ted Holden
TERMS OF DISTRIBUTION :
Redistribution of version 1.0 of Instant Index must include
the software, its documentation file, order form and all supplemental
files as a single unit without any modification AND subject to the
following conditions:
1. Any individual is welcome to make copies for his/her friends
and/or colleagues if NO FEE is charged.
2. Electronic bulletin boards, whether they charge or do not
charge their users subscription fee, are welcome to post the
program for down loading as long as they do not charge any fee
in particular for the distribution of Instant Index.
3. Computer information services such as CompuServe (CIS), Genie,
etc., may post this software for their subscribers.
4. Non-commercial user groups and computer clubs may distribute
the program to their members if the fee charged for the
diskette containing Instant Index does not exceed $10.
5. Disk vendors approved by the Association of Shareware
Professionals or disk vendors who explain the concept of
shareware in their ads that quote a price may distribute the
shareware version of Instant Index.
6. Persons or enterprises wishing to distribute Instant Index
in combination with other hardware, software, books or materials
must obtain proper licensing agreements from HT Enterprises.
Instant Index Copyright 1992, 1993 Ted Holden
DISCLAIMER OF WARRANTY
THIS SOFTWARE AND MANUAL ARE SUPPLIED "AS IS". THE AUTHOR HEREBY
DISCLAIMS ALL WARRANTIES RELATING TO THIS SOFTWARE AND ITS
DOCUMENTATION FILE, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
TO DAMAGE TO HARDWARE, SOFTWARE AND/OR DATA FROM USE OF THIS
PRODUCT. IN NO EVENT WILL THE AUTHOR OF THIS SOFTWARE BE LIABLE
TO YOU OR ANY OTHER PARTY FOR ANY DAMAGES. THE WORST POSSIBLE
CASE FOR SOFTWARE FAILURE, IN OUR VIEW, WOULD BE FOR THE
COMPUTER INVOLVED, THE HOUSE OR BUILDING IN WHICH IT IS LOCATED,
AND THE ENTIRE NEIGHBORHOOD CONTAINING THAT BUILDING TO BURN TO
THE GROUND DUE TO SOME UNFORSEEN SOFTWARE BUG; EVEN IN THAT
CASE, NEITHER THEODORE HOLDEN NOR HT ENTERPRISES WILL ACCEPT ANY
LIABILITY.
HT Enterprises cannot and will not be liable for any special,
incidental, consequential, indirect or similar damages due to loss
of data or any other reason, even if HT Enterprises or an authorized
HT Enterprises agent has been advised of the possibility of such
damages. In no event shall the liability for any damages ever
exceed the price paid for the license to use software, regardless of
the form and/or extent of the claim. The user of this program bears
all risk as to the quality and performance of the software.
YOUR USE OF THIS SOFTWARE INDICATES THAT YOU HAVE READ AND AGREE TO
THESE AND OTHER TERMS INCLUDED IN THIS DOCUMENTATION FILE.
VERSIONS AVAILABLE
Version 1.0 of Instant Index (included) is a shareware version which
handles one (presumed large) ascii text file, with a .TXT suffix, at a
time. This version is more than proof of concept; it should actually be
of more value than the full multi-file version to certain groups of
users, particularly CD ROM vendors and others involved in distributing large
ascii text files. Naturally, licensing arrangements must be made with HT
Enterprises and the author of this software for such use.
Version 2.0 uses the same indexing technology to index and search entire
directories, and all of the files in them. Text in ascii files is
shown via the text handling mechanism of Instant Index itself;
application files are brought up either in this method or in the
applications which created them. Provision is made for applications
which do not use file extensions. This version serves the needs of the
user who has lots of text in lots of files and requires being able to
very quickly find the files which contain a certain text pattern or
group of words.
WHAT IT IS
Instant Index requires a 386 or 486 computer and MS Windows
3.1. It does not run in any other environment as of yet.
Instant Index represents a software genre which most will be less
familiar with than they are with the usual spreadsheets and
wordprocessors.
This genre is called full text search, and involves indexing large to
gigantic bodies of text on disk in such a way as to allow large scale
and rapid searching for words, phrases, and combinations of
words in proximity etc. Large bodies of text are just now becoming
increasingly common and available in DOS format, particularly with the
proliferation of CD technology. A really good program for handling
large bodies of text is clearly needed.
There are two reasons why the average PC user is not familiar with this
software genre:
1. Until now, such software has been very expensive. License fees of
$1000 to $20,000 for a single user computer have been the norm.
2. Until now, such software has been very slow; recent articles
in PC Week and InfoWorld describe leading products taking upwards
of two hours to index text files ranging from 13 - 26 MB. The
average PC user would have (justifiable) difficulties in dealing
with this psychologically. Basically, anything which takes two
hours or more to happen on a 486 isn't really a solution to
anything; it's a problem.
The HT Enterprises Instant Index program solves both
problems. It is priced well within the reach of the average PC user
and is FAST. II can index a 100 MB file in under 20 minutes
on a typical 33 MH 486 PC running MS Windows. It is something like 100
times faster than the fastest products until now.
We don't really know how large a file you could use with Instant
Index on ordinary 386/486 PC's; we suspect it could handle files in
the .6 GB to GB range.
Normal use for II would be to find a certain section
of text and paste it into a wordprocessing document in Ami Pro,
WordPerfect, or some other full-function Windows wordprocessor.
REGISTRATION FOR HT ENTERPRISES PROGRAMS
II version 1.0 is intended as a home product and also as a means for businesses,
corporations and the like to evaluate the features and performance of
the Instant Index concept. There is also a certain class of users which
might find version 1 or some adaptation of it more useful than the full
version (2), and for such applications, licensing arrangements must
be made with HT Enterprises.
THE VAST BULK OF USERS WILL HAVE A GREAT DEAL MORE USE FOR VERSION 2.0,
AND IT IS NOT TERRIBLY EXPENSIVE!
Good site license terms for Instant Index are available. No version of
II may be used in businesses, organizations, corporations,
schools, government agencies etc. for production work without proper
licenses being in place.
Home computer users may use II version 1.0 for one month on a demo basis.
Beyond that, however, registration is required for continued use of
II. The included form should be used to register a copy of II.
Registered users of Instant Index (any version) receive technical
support & news of upgrades and new products, which in the future will
include other AI applications. If you haven't guessed already, II is an
AI application.
.............................................................................
REGISTRATION FORM For Individual Software Licenses
PROGRAM: # COPIES: AMOUNT:
II Version 2.0 ($200 per copy) _________ $______________
Intro price good thru 5/30/93
II Version 1.0 ( $30 per copy) _________ $______________
TOTAL. . . . . . . . . . . . . . . . . . . . . . . $______________
PAYMENT BY:
Check/Money Order No.__________ enclosed for $____________________
MAILING ADDRESS:
NAME______________________________________________________________
ADDRESS LINE 1____________________________________________________
ADDRESS LINE 2____________________________________________________
CITY/STATE/PROVINCE_______________________________________________
COUNTRY/POSTAL CODE_______________________________________________
HOME PHONE________________________________________________________
OFFICE PHONE______________________________________________________
SEND TO: HT Enterprises
8375 Leesburg Pike, Suite 422
Vienna Va. 22182
Call HT Enterprises at (703) 760-9713 for site license pricing.
INSTANT INDEX
By
HT Enterprises
ASSUMPTIONS
Instant Index is a piece of software designed for handling
large to gigantic text data files. Instant Index runs under
MicroSoft Windows 3.1 and assumes at least a 386/486 based
computer with a minimum of 4 MB of memory and a mouse pointing
device. Instant Index assumes an ASCII text file with a .txt
extension, and creates a corresponding .con (control) index
file. Aside from the program itself, Instant Index must keep
one of these index files in memory (Windows 3.1 swap space on
disk is included as memory in this reckoning) while searching
the .txt file. The index files are typically around 6% the size
of the original data file. This means that a 100 MB file could
be searched easily enough with a 486 computer with 8 MB RAM
memory. 486 Computers are now being configured with 64 MB of
RAM; this means that the outer limits of size for text files
for use with Instant Index should be around 800 MB or so.
Bottom line: Instant Index absorbs around 400K bytes when
loaded with a minimal sized control file. You need that 400K
plus enough space for your index file.
It is assumed that users are familiar with DOS files and
directories, normal copy commands etc., and with the workings of
MS Windows, ordinary file and font dialog boxes etc.
SETUP
You paid lots of money for Instant Index; therefore, it should
be time-consuming and difficult to install on your computer,
right? Sorry to dissapoint you. You'll find two executables on
the distribution diskette: II.exe (the main program), and
wtxt.exe, which is the program which creates indices. II.exe
calls wtxt.exe with a WinExec call, which means that wtxt.exe
has to be in a directory which is on your path. II.exe could be
anywhere. You simply go through the MS Windows process for
adding an executable to one of the normal program groups, which
would usually be the Windows Applications group.
Copyright 1992 Ted Holden
I. What Instant Index is and isn't.
Instant Index is an awsomely fast system for indexing and searching
large to gigantic text files. It assumes a user has one or
more ascii text files with a .txt extension, and then creates
matching .con (control) files for indexing. The text files may
then be searched for words or combinations of words in settable
proximity, and text may then be pasted into typical MS Windows
word processing software using the Windows clipboard. Instant
Index is single-purpose; it does one thing and does that one
thing well.
II. Technical Basis
Typical text-search software generates tables of key-words
which hash into tables of linked lists of sector locations for a
data file. This methodology allows fast search once it is set
up for a particular data file, but setting it up is very time
consuming. Index files (the keyword tables and linked lists
etc.) tend to be not much smaller than the original data files,
which can be a problem with very large files.
Instant Index, on the other hand, utilizes statistical methods
and a variant of the Lawrence transform to achieve a very fast
correlation between textual content and location, and produces
index files which are typically 6 percent of the size of the
original data file. This system is more malleable than the
standard keyword hashing algorithms; a number of desirable
functions, such as actual fuzzy searching on very large data
sets, are natural fallouts of the technology. It is not easy to
imagine fuzzy searching on a file too large for memory using
keyword tables and hashing algorithms.
III. Speed and Power.
The standard test file which we've been working with at HTE is
the King James Bible, about 4.6 MB of text, and Instant Index
can index that in something like 30 seconds on a 33 MH generic
486 with a 17 MS disk. This would allow a 100 MB file to be
indexed for rapid search in under 15 minutes on a computer
costing less than $2000. This sort of power and speed give a
user options which he otherwise simply would not have in dealing
with large text data sets. Text being scanned in or piling in
from a news feed, for instance, can now be dealt with in rapid
and easy fashion. The thought of re-indexing a large file which
has changed ceases to cause the fear and panic which it formerly
did.
IV. Characteristics of Instant Index.
In contrast to normal software, Instant Index has some of the
same characteristics, the same strengths and, occasionally, a
few of the same kinds of quirks as the human mind and human
memory. There are two pieces to the Instant Index search
mechanism: the transform-based initial search engine, the
action of which is instantaneous in all cases, and a "grep" -
like secondary or clean-up function. Normal text search
software gets slow when given long search strings; Instant
Index gets faster. The more specific a search criteria you give
it, i.e. the longer the search string, the closer you come to
having the math-transform 1'st stage system do all of the work,
and the faster the whole process becomes.
For instance, the character string "lions" occurs in "millions"
and a number of other words; the fragment "ions" occurs even
more often. Therefore a search of the bible (our standard test
material) for "lions" returns several hundred hits, too many to
serve any useful purpose. Adding the string "Daniel", or
"mastery", however, narrows the search down to a few instances
in the book of Daniel, the response being nearly instantaneous.
Typical search phrases such as "behold, a pale horse", or
"fishers of men" , are plenty long enough in most instances to
return the one or two hits expected and nothing more. Words
such as "the" or "and" add nothing to a typical search for
obvious reasons, and may be omitted. Any word with an unusual
combination of letters, such as "archeologist" or
"paleontologist", or likewise any word with four or more
syllables, will often work well by itself as a search criteria.
When a search turns up too many hits to be useful, you can
always add another word to the end of the search string and try
again. Adding words always narrows the search down and speeds
things up.
At times, you have to be a little bit smart about how you use
any tool, and II is no exception. For instance, searching
Shakespeare's works for a famous phrase, such as Hamlet's "To be
or not to be, that is the question!", turns out to be very slow
on II. The truth of the matter is, that the only word in that
whole phrase with any power of discrimination within the context
of English text, is the word "question". A search for "Whether
tis nobler of the mind" turns out to be quite fast and is, for
fairly obvious reasons, a better use of the tool.
V. Verify and Redline.
The Verify and Redline functions (menu keys) effect the actions
of a search. Verify is the "grep" - like, or ordinary search
function which cleans up after the action of the statistical
engine of Instant Index. Anytime a search returns more slowly
than instantaneously, Verify is at work. Verify removes false
hits, or the tiny amount of statistical aliasing produced by the
statistical engine of Instant Index. If you turn Verify off, II
(Instant Index) becomes instantaneous in all cases, but you'll
find yourself having to give longer and more precise search
strings to cut the number of hits down to acceptability. The
normal situation in which you turn verify off is for fuzzy
matching applications in which you assume data produced from
scanning and OCR is less than 100% good on spelling. In that
case, Verify would always fail upon encountering a misspelled
word, and would prevent the entire process from working.
Redline highlights the section of text which you are looking
for when a hit is returned to the screen. Redline has two
modes: 2-Lines and All. 2-Lines highlights the text you are
searching for only when it occurs within two successive lines of
text, which is normal for a phrase. The All option causes
highlighting to occur for any line containing any word within
the search criteria. When using this, you must leave out words
such as "and", "the", "a" etc. or every line on the page will
be lit up. The All option is good when searching for a few key
words which may be assumed to lie in close proximity, but not
necessarily on the same one or two lines, in a particular
section of a large data file.
VI. Motion Control: Next Hit, Previous Hit, Forward, Back,
scrollbar
The parameters dialog box for the indexing function of Instant
Index allows you to set a data file sector size (not the same
notion as disk sector size) for searching. Searches then seek
sectors which contain all of the words in a search criteria.
If, for instance, five such sectors are found, the search will
come back with a message box claiming <5 HITS!>. The first hit
sector will be put up on the screen, or at least as much of that
sector as the screen will hold. Next Hit and Previous Hit move
to the next or previous hit sector. Forward and Back position
the file forward or back 512 bytes at a time.
The scroll bar included in Instant Index positions the view
screen within the text file and indicates, more or less, where
in the text file a search string has been found.
VII. Minimizing aliasing and false hits.
The advantages of Instant Index in comparison with standard
text search software are huge. The only very minor down side is
aliasing or false hits, which comes with the territory with a
statistical methodology, and this can easily be controlled. The
statistical back-end engine returns all sectors in which all
words in a search criteria occur. Making the search string
longer and more precise allways narrows the search down and
speeds up the process, since it cuts down the amount of work
required of the verify function.
VIII. Double Hits.
Instant Index occasionally returns a double hit, i.e. returns
the same hit twice. This is is a very minor nuisance which is
unavoidable in the design of such a package. It is a by-product
of the system for insuring that search strings which span two
file sectors still get reported without losing performance or
increasing index file size.
IX. Open and Fonts
The Open function assumes that a .txt file and a corresponding
.con file exist in a directory somewhere, i.e. that you have
availed yourself of the Index function to create a .con file for
your .txt file. Other than that, Open is just an ordinary
Borland FileDialog box.
Fonts is a fairly standard font select dialog box. If you
haven't seen these before, clicking on ".." is equivalent to "CD
.." under DOS or UNIX. There's nothing else mystical about them.
X. Redlining and Copy/Paste
Aside from lines which get redlined by the Verify function, you
can hold the left mouse key down and redline any lines which
appear on the screen. Clicking the right mouse key undoes any
redlining. The Copy/Paste key puts any redlined text into the
MS Windows clipboard edit buffer, from which it may be retrieved
using the "Paste" feature of any full-function MS Windows word
processor. This is the normal use of Instant Index. Basically,
you find something you want in a huge text file, then you paste
it into a word processor and do your own thing with it.
XI. Fuzzy logic and wildcard-like searching
Including the fragment "direct" in a search criteria will
return "director", "direction", "directing" etc. etc. i.e.
wildcard searching is achieved by simple shortening or omission.
Fuzzy searching is another topic. We believe we have done the
best job which is doable with fuzzy searching with Instant
Index, nor is it obvious that fuzzy searching could be achieved
at all for a file too large for memory using traditional
methods. Fuzzy searching means being able to find text which
might be misspelled. Bottom line is that the best you could
ever hope for is finding some percentage over 50% of such
criteria. We believe we're way over 50%, but anything more
precise than that would be a wild guess.
Fuzzy logic is an overused concept, like the word "turbo".
Your best procedure for dealing with scanned text or other text
prone to misspellings, if you have this option, would be to run
the text through some serious spell-checker and then use Instant
Index on it. The guys who write spell-checkers are like us;
they're good at what they do.
For a very large scanned text file, this may not be possible.
Read through the section on the parameters dialog box for the
Index function so that you know what goes into preparing a file
for fuzzy searching. Basically, when you create an index for a
file which you plan to do fuzzy searching on, you want to set
the Search Depth parameter as high as possible, allowing for the
fact that the index file must be kept in memory. The Fuzzy
Value dialog box allows you to set values of 0 (no fuzziness), 1
(one letter missed in a search criteria), or 2 (two letters
missed in a search criteria). Beyond that only prayer would
help.
For fuzzy searching, set Verify to OFF and Redline to ALL.
Fuzzy searching raises the rate of statistical aliasing. You
have to know something about what you're looking for.
Basically, you just keep adding words to the end of the search
string (in the Search dialog box), untill the number of hits is
down to something acceptable.
XII. Search.
The Search function brings up an ordinary edit dialog box in
which you type a couple of words or a phrase to search for. The
text you typed in remains after the search. You can add a word
or two (to narrow down the search) simply by adding after the
end of a string already in the dialog box.
XIII. Indexing.
The Index function executes the wtxt.exe program mentioned in
the section on setup. Wtxt.exe is another MS Windows program,
and may be thought of as simply a non-modal dialog box or
extraneous window; that's precisely what it appears as. It has
two functions: Create and Parameters.
XIV. Create.
The Create function is an ordinary Borland file dialog box.
You use it to select a file with a .txt extension and create a
corresponding index (.con) file for it. After that, you can
either leave wtxt.exe on the screen, possibly to create several
.con files in one sitting, or close it.
XV. Parameters.
The Parameters function in wtxt.exe allows you to set a number
of parameters which figure into creating index files:
A. Alphabet:
The upper and lower case characters of the alphabet being used
for searching. This could be anything for which an MS Windows
font exists. There is no reason why German or Russian text
or even something as strange as French text could not be
searched. Instant Index is not case sensitive. Be sure that
upper and lower cases include equal numbers of characters. We
assume a phoenetic alphabet, left-to-right, high-to-low, all of
those sorts of things.
A plug for another of our products may in fact be in order here.
We have one of the most interesting Russian font sets in
existence, including standard Cyrillic, a fairytale font, and a
Russian version of a Cloister font in ATM format. Call for
info.
B. Other Characters:
Other characters (than the alphabet) to include in search
strings. Typically, just the numbers 0 - 9. For instance, for
bible searching, you might also include a colon ( : ) to allow
you to search for such things as "Gen 1:7". Instant Index
allows a total of 60 characters all told, counting each
upper/lower-case pair as one character.
C. Search Density.
Basically, this is just the size of the index file. Raising
this value by one doubles the size of the index file from the
previous value. The up side is that this reduces statistical
aliasing. This becomes helpfull for fuzzy searching. Assuming
somebody doing fuzzy searching has the memory to deal with it,
the larger index file is better.
D. Text File Section Size.
This is the size (in bytes) of a section within the text file to
serve as a base of reference. Instant Index thinks of the text
file as consisting of sectors of this unit of size. The
back-end statistical search engine returns sectors within which
all words of a search criteria are found. 2048 Bytes is the
default. As of now, we can think of no real reason for having
the sector size smaller in the normal case. For fuzzy logic
searching with Verify off, a lower value would let you see an
entire file sector on one screen, which might be helpful.
Halving the section size doubles the size of the index file.
E. Anti Aliasing.
This one is a no-brainer. Anti aliasing is set for the English
language at present. For English text, leave it on. For other
language text, turn it off. The feature is worth having, as it
generally reduces the incidence of false hits and speeds up the
program (reduces the job of the Verify function). We would
require 5 - 10 MB of text in another language, along with an
appropriate MS Windows font, to set up a version with
Anti-aliasing for another language.
The unique anti-asliasing feature of Instant Index is the chief
point which differentiates this program from other attempts to
use the Lawrence transform, and what allows the program to use
an index file 6% the size of the data file rather than the more
usual 20 - 50%.